Visualizing Data with R

Zhaowen Guo

Agenda

  • ggplot2 and elements of information visualization

  • Visualizing tabular data

    • Information from one variable
    • Information from multiple variables

Graphic systems in R

  • Imperative programming
    • Step-by-step instructions to control the exact construction of output
    • Hands on and more work
    • base, grid, tile
  • Declarative programming
    • Allow software to apply a standard solution
    • Customize with a stylesheet
    • ggplot2

What is ggplot2

  • Graphics link data to dimensions of specific aesthetic objects, which are distinguishable by their geometric structure and modifiable in scale and style
  • Users supply the data and request a geometry, leaving others to the software
  • Layered grammar of graphics
    • Layers: components of a graph connected by +
    • Aesthetics: specified inside layers about how layers appear

Elements of information visualization

Objects

  • Known as \(geoms\) in ggplot2 which specifies how the data are presented on the plot
  • Points or texts can represent location: geom_point(), geom_text(), geom_label()
  • Lines can represent numerical values and relationships: geom_line(), geom_smooth()
  • Polygons can represent area or size: geom_rect(), geom_bar()

Elements of information visualization

Aesthetics

  • Colors: use default colors or brewer palettes ColorBrewer, R palettes and specify fill or color argument

  • Line types: specify the linetype argument by an integer or a character (see reference here)

  • Dot types: specify the shape argument by an integer or a character (see reference here)

  • Stacked bar plots or histograms: specify the fill argument

Elements of information visualization

Components

  • Title
  • Legend
  • Annotations
  • Labels
  • Background

Steps to create a ggplot2 graphic

  • Establish a mapping between data variables and plotting dimensions or elements
  • Apply the mapping to one or more standardized aesthetic elements
  • Draw the resulting set of graphical objects

An illustration

How did GDP per capita change over time in Oceanian countries?

options(scipen = 999) # prevent scientific notation like e
library(gapminder) # load gapminder dataset from gapminder package
library(ggplot2) # load ggplot2 package for visualization
library(tidyverse) # load tidyverse package for data wrangling
str(gapminder) # examine gapminder dataset
tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
oceania <- gapminder %>%
  filter(continent == "Oceania")

An illustration

ggplot(oceania, # input the data
  aes(x = year, 
      y = gdpPercap, 
      color = country,
      linetype = country)) + # establish aesthetic mappings
  geom_line(size = 1) + # apply mapping to geom objects
  ggtitle("Life expectancy in Oceanian countries over time") + # add title
  labs(x = "Year", y = "GDP per capita") + # add labels
  theme_bw() # change background to white background with grey gridlines

An illustration

ggplot(oceania, # input the data
  aes(x = year, 
      y = gdpPercap, 
      color = country,
      linetype = country)) + # establish aesthetic mappings
  geom_line(size = 1) + # apply mapping to geom objects
  ggtitle("Life expectancy in Oceanian countries over time") + # add title
  labs(x = "Year", y = "GDP per capita") + # add labels
  theme_bw() # change background to white background with grey gridlines

knitr::kable(oceania)
country continent year lifeExp pop gdpPercap
Australia Oceania 1952 69.120 8691212 10039.60
Australia Oceania 1957 70.330 9712569 10949.65
Australia Oceania 1962 70.930 10794968 12217.23
Australia Oceania 1967 71.100 11872264 14526.12
Australia Oceania 1972 71.930 13177000 16788.63
Australia Oceania 1977 73.490 14074100 18334.20
Australia Oceania 1982 74.740 15184200 19477.01
Australia Oceania 1987 76.320 16257249 21888.89
Australia Oceania 1992 77.560 17481977 23424.77
Australia Oceania 1997 78.830 18565243 26997.94
Australia Oceania 2002 80.370 19546792 30687.75
Australia Oceania 2007 81.235 20434176 34435.37
New Zealand Oceania 1952 69.390 1994794 10556.58
New Zealand Oceania 1957 70.260 2229407 12247.40
New Zealand Oceania 1962 71.240 2488550 13175.68
New Zealand Oceania 1967 71.520 2728150 14463.92
New Zealand Oceania 1972 71.890 2929100 16046.04
New Zealand Oceania 1977 72.220 3164900 16233.72
New Zealand Oceania 1982 73.840 3210650 17632.41
New Zealand Oceania 1987 74.320 3317166 19007.19
New Zealand Oceania 1992 76.330 3437674 18363.32
New Zealand Oceania 1997 77.550 3676187 21050.41
New Zealand Oceania 2002 79.110 3908037 23189.80
New Zealand Oceania 2007 80.204 4115771 25185.01

Information from one variable

# show frequencies of a variable
gapminder %>%
  filter(year == 1952) %>%
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2) +
  theme_light() +
  labs(x = "Life Expectancy", y = "Count", title = "Life Expectancy in 1952")

# show frequencies of a variable
gapminder %>%
  filter(year == 1952) %>%
  ggplot(aes(x = lifeExp)) + 
  geom_density(size = 1.5, alpha = 0.2, fill = "red") +
  theme_light() +
  labs(x = "Life Expectancy", y = "Count", title = "Life Expectancy in 1952")

# show distribution of a variable (median, 1st, 3rd quantiles, outliers)
gapminder %>% filter(year == 1952, continent=="Europe") %>%
  ggplot(aes(y = lifeExp)) + 
  geom_boxplot(fill = "grey", color = "blue", outlier.shape = 1) + # adjust aesthetics
  theme_light() +
  labs(title = "Life Expectancy in 1952 (Europe)", 
       y = "Life Expectancy", 
       x = "") 

# show distribution of a discrete variable 
ggplot(gapminder, 
       aes(x = continent,
           fill = continent)) + # differentiate the filled colors
  geom_bar() + 
  theme_classic() + 
  labs(y = "Number of countries", x = "Continent")

americas <- gapminder %>% 
  filter(year == 2007 & continent == "Americas") %>% 
  arrange(gdpPercap) %>% 
  mutate(country = factor(country, levels = country))

ggplot(americas, aes(x = gdpPercap, y = country)) +
  geom_segment(aes(x = 0, xend = gdpPercap, 
                   y = country, yend = country), # which is why we need to make country a factor
               color = "black") + 
  geom_point(colour = "blue", size = 2, alpha = 0.8) +
  scale_x_continuous(expand = c(0, 0), 
                     limits = c(0, max(americas$gdpPercap) * 1.1),
                     labels = scales::dollar) + 
  theme_bw()

Information from multiple variables

gapminder %>%
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  facet_wrap(~ continent, ncol = 3) +
  labs(title = "Life expectancy distribution by continent") +
  theme_minimal()

gapminder %>%
  filter(year == 2007 & continent != "Oceania") %>%
  ggplot(aes(x = lifeExp)) + 
  geom_density(aes(fill = continent), size = 0.1, alpha = 0.5) + 
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Life expectancy distribution in 2007") +
  theme_minimal()

gapminder %>% 
  filter(year > 1990) %>%
  group_by(year, continent) %>%
  summarise(totalpop = sum(as.double(pop))) %>%
  ggplot(aes(x = year, y = totalpop, fill = continent)) + 
  geom_col(position = "dodge", size = 0.2, alpha = 0.8) + # dodge overlapping objects side by side 
  scale_x_continuous(breaks = seq(1992, 2007, 5), expand = c(0, 0)) +
  scale_y_continuous(labels = scales::comma, expand = c(0, 0)) +
  scale_fill_brewer(palette = "Set1") +
  theme_bw()

Practices

Exercise 1: Make a scatter plot with average GDP per capita across all countries on the y-axis and year on the x-axis.

Exercise 2: Break down the plot from exercise 1 by continent, using colors to distinguish the points and transforming mean GDP per capita on the log scale.

Exercise 3: Make a collection of bar plots faceted by year that compare mean GDP per capita across countries in a given year. Orient the plots to make it easier to read the continent labels.

Exercise 4: What is the relationship between life expectancy and GDP per capita in 2007 by non-Oceanian continents?

Solutions

gapminder %>%
  group_by(year) %>%
  summarize(meanGDPpc = mean(gdpPercap)) %>%
  ggplot(aes(x = year, y = meanGDPpc)) +
  geom_point()

gapminder %>%
  group_by(year, continent) %>% # aggregate the information by year by continent
  summarize(meanGDPpc = mean(gdpPercap)) %>%
  ggplot(aes(x = year, y = meanGDPpc, color = continent)) +
  geom_point() +
  scale_y_log10() # apply the log scale to GDP per capita

gapminder %>%
  group_by(year, continent) %>%
  summarize(meanGDPpc = mean(gdpPercap)) %>%
  ggplot(aes(x = continent, y = meanGDPpc)) +
  geom_col() +
  facet_wrap(~ year) +
  coord_flip() # flip the coordinates so that the continent names are visible

gapminder %>%
  filter(year == 2007 & continent != "Oceania") %>% 
  ggplot(aes(x = log(gdpPercap),
             y = lifeExp,
             color = continent)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") + 
  facet_wrap(~ continent)

Bonus: animations!

Check out plotly package

library(plotly)
animate_gapminder <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
  geom_point(aes(size = pop, ids = country, frame = year)) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_x_log10() + 
  theme_bw() + 
  labs(x = "GDP per capita", y = "Life expectancy") +
  theme(legend.position = "none") # remove legend

ggplotly(animate_gapminder) %>% 
  highlight("plotly_hover") %>%
  animation_slider(
    currentvalue = list(prefix = "Year ", font = list(color="black"))
  )